A multi-modal attention neural network for traffic flow prediction by capturing long-short term sequence correlation

Accurate traffic flow prediction information can help traffic managers and drivers make more rational decisions and choices. To make an effective and accurate traffic flow prediction, we need to consider not only the spatio-temporal dependencies between data, but also the temporal correlation between data. However, most existing methods only consider temporal continuity and ignore temporal correlation. In this paper, we propose a multi-modal attention neural network for traffic flow prediction by capturing long-short term sequence correlation (LSTSC). In the model, we employed attention mechanisms to capture the spatio-temporal correlations of the sequences, and the model based on multiple decision forms demonstrated higher accuracy and reliability. The superiority of the model is demonstrated on two datasets, PeMS08 and PeMSD7(M), particularly for long-term predictions.

Over the past few decades, an increasing number of private cars have brought a series of problems such as traffic congestion and parking difficulties.Accurate traffic prediction can provide powerful decision-making basis for traffic managers and enable drivers to choose smoother roads for travel 1 .Traditional machine learning to is often used predict traffic flow, such as Cai et al. 2 proposed k-nearestneighbor (KNN) for traffic flow prediction.However, the KNN algorithm is a distance-based method that assumes linear relationships between samples.In traffic flow prediction, the relationship between the past and future flow is often nonlinear, making the KNN algorithm less suitable for effectively fitting the data.With the development of deep neural networks, methods such as long short-term memory (LSTM) and gated recurrent unit(GRU) have emerged to handle temporal dependency 3,4 , while methods such as convolutional neural network (CNN) and graph convolution network (GCN) have emerged to handle spatial dependency 5,6 .And some other approaches, such as Medrano et al. 7 using attention mechanisms and Zhang et al. 8 leveraging graph convolution, have been effective in predicting traffic flow.However, these prediction methods still have three limitations.

Fixed spatial dependency
In the traffic road network, the traffic flow between different nodes often affects and correlates with each other.As shown in Fig. 1, the traffic flow at node A may be influenced by nodes C and D. This influence may even change over time, for example, the correlation between an industrial zone and a residential area may be stronger on workdays but weaker on non-workdays.Therefore, when conducting traffic flow prediction, it is necessary to capture this dynamic spatial dependency.

Limited-range temporal dependency
For a certain node, different historical traffic flows may have different impacts on the current traffic flow at that node.As shown in Fig. 2, the traffic flow at node A at time t l may have weak dependency with the traffic flow at time t l−n , but strong dependency with the traffic flow at time t l−n−1 .Therefore, capturing this type of nonlinear and highly dynamic long temporal dependency is also one of the key points of traffic flow prediction.
• We propose a capturing long-short term sequence correlation method for discovering the relationship between traffic flow of neighbor time spans.• We develop a multi-modal attention framework by fusing the periodicity and temporal sequence correlation for traffic flow prediction.• We evaluate the LSTSC model on two real-world datasets and the experimental results demonstrate that the LSTSC model outperforms the baseline algorithms.The structure of this work is summarized as follows."Related work section" section give the related work on traffic flow prediction.The definition and notation of traffic flow have been given in "Definition and notation" section.The general framework of the proposed model is presented in "A multi-modal attention neural network" section."Experiment and result analysis" section presents the results of the model.Finally, the paper is concluded in "Conclusion" section.

Related work
In the section, we will elaborate on the traffic flow prediction method based on graph convolution, the methods based on CNN and the methods based on attention mechanism.

Traffic prediction methods based on CNN
CNN model is one of the most important classical structures in deep learning models, and which is often used to solve the traffic prediction problem.Yang et al. 11 classified traffic data according to proximity (short-term characteristics), periodicity and trend (long-term characteristics), and mapped them into a two-dimensional space composed of time and space.The high-level spatio-temporal features learned by CNN from matrices with different time lags are further fused with external factors through a logistic regression layer to obtain the final prediction.Zhang et al. 12 used the spatio-temporal feature selection algorithm (STFSA) to determine the optimal input data time delay and spatial data amount, and extracted the selected spatio-temporal traffic flow features from the actual data and converted them into a two-dimensional matrix.A CNN is later used to learn these features to build a prediction model.Cao et al. 13 proposed a traffic speed prediction model based on CNN and LSTM.Firstly, CNN was used to extract the daily periodicity and weekly periodicity characteristics of traffic speed in the target area, and the spatio-temporal characteristics of CNN output were extracted through the LSTM layer.Ma et al. 14 used the nonlinear fitting ability of CNN to extract deep features from the convolutional layer and pooling layer for model training.Yu et al. 15 used 3D convolutional kernels to simultaneously extract and fuse spatio-temporal features in traffic flow data to ensure that temporal information is treated as spatial information in all network layers.
Although CNN can capture the spatial dependencies in traffic flow prediction, the topology of traffic networks is typically irregular, and traditional CNN are better suited for regular grid-like data, making it challenging to handle irregular data.

Traffic prediction methods based on graph neural networks
The GCN model acts as a feature extractor just like a CNN, except it works on graphs.Zhao et al. 16 proposed a temporal GCN, which combined GCN and GRU.In simple terms, for complex topologies of traffic data, we can use GCN to capture spatial dependency and GRU to capture temporal dependency of traffic data.Ali et al. 17 combined GCN based on LSTM with previously published models to capture spatial patterns and short-time temporal features of images.Chen et al. 18 proposed a novel location-graph convolutional network (Location-GCN).Location-GCN adds a new learnable matrix to the GCN mechanism, and uses the absolute value of the matrix to represent the different degree of influence between different nodes.Peng et al. 19 used the dynamic traffic flow probability graph to model the traffic network, and performed graph convolution on the dynamic graph to learn the spatial features of the data, and combined with the LSTM unit to learn the temporal features of the data.Tang et al. 20 adjusted the graph convolutional network based on spatial correlation to extract the spatial features of the road network.
LSTM was designed to address the issue of short-term time dependencies in traditional RNN.However, in excessively long sequences, problems of gradient vanishing or exploding can still arise.Gradient vanishing prevents the model from learning long-term dependencies, while gradient exploding leads to numerical overflow, causing instability in network training.

Traffic prediction methods based on attention mechanism
Attention mechanism is a commonly used module in deep learning.As a resource allocation scheme, it uses limited computing resources to process more important information, which is the main means to solve the problem of information overload.Liao et al. 21proposed an improved dynamic Chebyshev GCN model.In this method, an attention mechanism based Laplacian matrix update method is proposed, which approximately constructs features from data of different periods.Wang et al. 22 provided a learnable location attention mechanism that can effectively aggregate the information of neighboring roads.Yin et al. 23 designed an internal attention mechanism to capture the temporal dependency, and in addition used adjacency as a prior to divide the nodes in the road network into different neighborhood sets.In this way, attention can dynamically capture spatial dependency within and between same-order neighborhoods.Zheng et al. 24 designed a Conv-LSTM model based on attention mechanism.A reasonable attention mechanism was designed in the model to distinguish the importance of different time stream sequences by automatically assigning different weights.Inspired by the role of attention mechanism in regulating information flow, Wei et al. 25 embedded the attention mechanism into GRU and LSTM recurrent modules in an attempt to focus on the important information of internal features.
Although the introduction of the attention mechanism has addressed some deficiencies in previous traffic flow models, these attention-based models still lack the capture of temporal correlation, meaning they do not capture the association between future and past data.Inspired by these studies, we use attention mechanisms and CNN to capture spatio-temporal dependencies and temporal correlation separately.

Definition and notation
In this section, we will give some definition and notations related to traffic flow forecasting.

Temporal dependency and temporal correlation
As shown in Fig. 3, temporal correlation can be defined as: by observing the historical traffic flows in two adjacent time slots and using CNNs to capture the temporal correlations in the data, the equation is as follows: where Conv represents CNN, while X t 1 ∼t n and X t 1+n ∼t 2n represent two adjacent historical traffic flow data seg- ments in the time dimension.In order to comprehensively capture the spatiotemporal information in the data, we introduced the channel dimension D. We are able to leverage the traffic flow information from multiple channels, integrating data from different channels to provide more comprehensive and accurate data features as output.
Temporal dependency can be defind as: by observing the continuous historical traffic flows in a time interval and using attention mechanism to capture the temporal dependencies between the data, the equation is as follows: where Att represents attention mechanism.

Traffic flow prediction problem
w th, T d+3 w th and T d+6 w th respectively represent the dth day, (d + 3) th day and (d + 6) th day of the wth week, and T d w+1 represents the dth day of the (w + 1) th week.We collect the traffic flow data X d w;t 1 ∼t 2n , X d+3 w;t 1 ∼t 2n and X d+6 w;t 1 ∼t 2n for time slots t 1 ∼ t 2n on the T d w th day, T d+3 w th day and T d+6 w th day, as well as the traffic flow data X d w+1;t 1 ∼t n for time slots t 1 ∼ t n on the T d w+1 th day as historical traffic flow data, and predict the traffic flow data for time slots t n+1 ∼ t 2n on the T d w+1 th day.Traffic flow prediction can be simply expressed as follows: where C represents the temporal correlation of historical traffic flow data, Xd w+1;t n+1 ∼t 2n represents predict the traffic flow data for time slots t n+1 ∼ t 2n on the T d w+1 th day, P represents the periodicity of historical traffic flow data, and M 1 and M 2 represent respective components of the traffic flow prediction model.The periodicity of P refers to the occurrence or variation of similar events, phenomena, or patterns at the same time intervals every week.

A multi-modal attention neural network
The overall framework of the model is shown in Fig. 4. The model first takes all historical traffic flow data as input, which includes the characteristics of traffic flow in time and space.These data are fed into the (spatialtemporal transformer) STTN module, whose goal is to extract the dynamic spatial dependencies and long-term temporal dependencies from the data.In other words, this module analyzes historical traffic flow data to identify patterns and trends in traffic, which are important information for predicting future traffic flow.Next, the model uses a CNN to continue capturing the long-term and short-term temporal correlation of historical traffic flow data based on spatio-temporal dependencies.By analyzing this data, CNN can identify traffic flow patterns that vary over time, which is crucial for predicting future traffic flow.Then, the model combines the long-term and short-term temporal correlation information extracted by CNN with historical traffic flow data.Finally, the model integrates periodic information into the prediction model to avoid errors caused by single decisions.This is because considering the periodic nature of traffic flow (such as different traffic volumes on weekdays and weekends) is very helpful for improving prediction accuracy.

Long-term temporal correlation and short-term temporal correlation modules
For the long-term temporal correlation module, the traffic flow data for time slots t 1 ∼t n and t n+1 ∼t 2n on T d+3 w th day and the traffic flow data for time slots t 1 ∼t n and t n+1 ∼t 2n on T d+6 w th day can be spliced on the channel dimension respectively.The equation is as follows: where the elements inside the brackets [.] are concatenated in a matrix format.
After splicing, the data dimension is reduced through the full connection layer.Finally, the spatio-temporal dependency and correlation capture module (STDCCM) module is input to obtain the spatio-temporal characteristics of the traffic flow data, and the long-term temporal correlation of the data is extracted.For the shortterm temporal correlation module, first extract the spatio-temporal characteristics of the traffic flow data for time slots t 1 ∼t 2n on T d w th day, and finally directly capture the short-term temporal correlation.The reason for the classification of short-term and long-term temporal correlation is that the importance of short-term and long-term temporal correlation may vary in different time steps. (1) X d+3,d+6 w;t n+1 ∼t 2n = X d+3 w;t n+1 ∼t 2n , X d+6 w;t n+1 ∼t 2n , The spatio-temporal dependency and correlation capture module(STDCCM) is shown in Fig. 5.This module consists of two parts, one is the STTN used to extract spatio-temporal dependencies, and the other is used to capture temporal correlation for temporal correlation.It first captures the spatio-temporal features of traffic flow data, and then extracts temporal correlation.
The STTN is composed of spatial transformer and temporal transformer.The key idea of spatial transformer is to assign different weights to different data points (such as sensors) at different time steps, as shown in Fig. 6, where a i,j represents the attention weight between node i and node j at the same time instant.Spatial transformer is composed of two parts, one is GCN, and the other is attention mechanism, as shown in Fig. 7.  where the query subspace spanned by Q S ∈ R N×d c , the key subspace by K S ∈ R N×d c and the value subspace by V S ∈ R N×d c .D is the channel dimension, and h is the number of heads in multi-head attention.
Attention scores S S ∈ R N×N between nodes are calculated with the cross-product of Q S and K S , Dynamic spatial dependencies S t 1 ∈ R N×d c can be obtained based on attention scores, value subspace, and the Residual Network, The inclusion of the feed forward network is to enhance the model's expressive capacity and non-linear modeling capabilities, where W S 0 , W S 1 , and W S 2 are the weight matrices for the three layers.The dynamic spatial dependencies and static spatial dependencies are fused using the following equation: (7)     www.nature.com/scientificreports/where f 1 and f 2 represent linear projection to convert S f 1 and S t 1 into one-dimensional vector.
Finally, the results Y s ∈ R N×D of the multi-head attention mechanism are fused together using the following equation: where W S 3 is the weight matrix.Through the multi-head attention mechanism, the model can simultaneously focus on different relationships and patterns, thus better capturing the diversity and complexity in the data.This helps improve the model's robustness and generalization, making it more effective and flexible in handling various types of input data.Additionally, the multi-head attention mechanism allows the model to attend to different feature interactions at different levels, enabling better extraction of high-level feature representations.
The key idea of temporal transformer is to achieve the acquisition of temporal dependency by assigning different weights to different time steps, as shown in Fig. 8. b α,β represents the attention weight, which is the allocation of attention between node 1 at two different time instants.Specifically, if we consider two time instants, such as α and β , and a node 1 exists at both time instants, then b α,β represents the attention weight between the node 1 at time instant α and the node 1 at time instant β .Temporal transformer is completely composed of attention mechanism, which can achieve long temporal dependency extraction, as shown in Fig. 9. Here, the value X t = Y s that is input to the temporal transformer.Similar to spatial transformer, temporal dependencies are dynamically computed in high-dimensional latent subspaces.
The process of the temporal transformer is similar, with three learnable matrices being defined: the query matrix W T q ∈ R d c ×d c , key matrix W T k ∈ R d c ×d c , and value matrix W T v ∈ R d c ×d c .The equations are as follows: where the query subspace spanned by Q T ∈ R H×d c , the key subspace by K T ∈ R H×d c and the value subspace by V T ∈ R H×d c , where H represents the size of the predicted time.D is the channel dimension, and h is the number of heads in multi-head attention.Attention scores S T ∈ R H×H between nodes are calculated with the cross-product of Q T and K T , (11)  where W T 0 , W T 1 , and W T 2 are the weight matrices for the three layers.Finally, the results Y t ∈ R H×D of the multi-head attention mechanism are fused together using the following equation: where W T 3 is the weight matrix.Temporal correlation is entirely composed of CNN and can capture temporal correlation by first concatenating the traffic flow data on the time dimension and then obtaining the temporal correlation through CNN.Y l t 1 signifies the spatio-temporal dependency of t 1 ∼t n within the long-term temporal correlation module.Y l t 2 represents the spatio-temporal dependency of t n+1 ∼t 2n within the long-term temporal correlation module.Y s t 1 corresponds to the spatio-temporal dependency of t 1 ∼t n within the short-term temporal correlation module.Y s t 2 indicates the spatio-temporal dependency of t n+1 ∼t 2n within the short-term temporal correlation module.

Fusion mechanism
This module is mainly composed of attention mechanism, and its function is to realize the combination of temporal correlation and historical traffic flow data.The module consists of two parts, cross attention and data fusion.The structure of cross attention is shown in Fig. 10.We take the combination of short term temporal correlation and historical traffic flow data as an example, where the query subspace by Q = Q d ∈ R H×d m , the key subspace by K = K d ∈ R H×d m and the value subspace by V = V d ∈ R H×d m .The equation is as follows: where query matrix W q ∈ R d m ×d m , key matrix W k ∈ R d m ×d m and value matrix W v ∈ R d m ×d m .They are respon- sible for converting the data information to the corresponding query subspace Q d , the key subspace K d and the value subspace V d .W F 1 and W F 2 represent weight matrices, and LayerNorm refers to layer normalization, which transforms the input of each neuron in a layer to have the same mean and variance, thereby accelerating convergence.D is the channel dimension of the data, h is the number of multiple attention.The spatio-temporal dependencies were captured by STTN for the traffic flow data in time slots t 1 ∼ t n on T d w+1;t 1 ∼t n th day, and this resulted in X ′d w+1;t 1 ∼t n .The same process applies to the long term temporal correlation cross attention module.The calculation equation used in the data fusion module is shown as follows: where W s , W l are weight matrices, and Ȳ s w;t 1 ∼t n and Ȳ l w;t 1 ∼t n are the output results of short-term and long-term cross attention, respectively.

Period module and prediction layer
In order to reduce the error caused by a single decision, a period module is proposed.The module uses the traffic flow data at the time of T d w th day, T d+3 w th day and T d+6 w th day, and first splices the data on the time dimension to obtain the output P ′ w ∈ R 3H×N , then extracts the spatio-temporal dependency of the data, and then reduces the dimension through the convolution neural network to obtain the final result P w ∈ R H×N of the module, where H represents the size of the predicted time, and N represents the number of sensors.
Then the output result of the period module is used as the input data of the prediction layer, which is composed of two layers of convolution.The equation is as follows:

Experiment and result analysis
In this section, the experimental process is described in detail from the following aspects: datasets, baselines, evaluation metrics, hyperparameter setting, convergence analysis, performance comparison and ablation studies.We use traffic speed data as traffic flow information.

Datasets
Two real datasets: PeMSD7(M) and PeMS08, are used to evaluate the performance of LSTSC model.All the data is scaled to 0 to 1 with min-max normalization in the experiments, and the details of the datasets are shown in Table 1.

Baselines
The following provides a description of the baseline algorithms that are compared with the LSTSC model.
• FC-LSTM: As LSTM only considers the time series and does not take into account the spatial correlation between them, FC-LSTM is an improvement of the LSTM model by adding an attention mechanism, where the input of each gate is determined by three parts.• DCRNN 26 : DCRNN introduces diffusion convolution as graph convolution to capture spatial dependency, and uses sequence-sequence architecture combined with GRU to capture temporal dependency.• STGCN 27 : STGCN introduces the graph neural network into the prediction of spatio-temporal series to effectively extract the spatio-temporal dependency.• GWNet 28 : GWNet includes two components, one is the adaptive dependency matrix, which is used to extract spatial dependency, and the other is the stacked dependent 1D conversion, which is used to extract temporal dependency.

Evaluation metrics
The evaluation metrics of LSTSC model are the same as before 23 , including mean absolute error(MAE), root mean square error(RMSE) and mean absolute percentage error(MAPE).The equation is as follows: where y i represents the actual value at a certain moment in T d w+1;t n+1 ∼t 2n th day, and ŷi represents the corre- sponding predicted value.n represents the size of the predicted time.The reason why the above three metrics are selected in this paper is that MAE and RMSE can better reflect the actual situation of the predicted value error.For MAPE, theoretically, the smaller its value, the better the fitting effect of the prediction model and the better accuracy.

Parameter settings
Table 2 describes the parameters of LSTSC in the experiment.We use 12 historical time steps to predict the next 12 time steps in the future.The CNN module, designed to extract temporal correlation, consists of a one-layer CNN with 12 filters, a stride of 1, a padding size of 0, and a convolution kernel size of 1 × 1 .The number of heads for multi-head attention in the experiment is uniformly set to 2. The CNN module used in the prediction layer is a two-layer CNN, with the number of filters set to 12 and 1 respectively, a stride of 1, a padding size of 0, and a convolution kernel size of 1 × 1 .LSTSC is optimized by Adam optimizer, and the batch size of the experiment is set to 16.

Hyperparametric studies
In this section, we investigate the influence of the dimension α of feed forward network to the results of traf- fic flow prediction, which belongs to the multi-head attention mechanism.We study the result of traffic flow prediction when α is 1, 2, 3, 4. As shown in Table 3 (the best results in the table have already been indicated in bold.),Thebest experimental results for the PeMSD7(M) dataset were achieved when α = 2.When using the PeMS08 dataset, the model achieved the best results for the MAE metric at 15 min and 30 min when α = 4, and at 60 min when α = 2.For the MAPE metric, the model achieved the best results at 15 min and 30 min when α = 3, and at 60 min when α = 2.For the RMSE metric, the model achieved the best results at 15 min, 30 min, and 60 min when α = 2. Therefore, when conducting long-term traffic flow forecasting, α value of 2 may be used.

Experimental results and analysis
The Highway Capacity Manual 29 recommends using a 15 min as short-term prediction interval for research and analysis purposes 30 .Table 4 describes the results of LSTSC model and baseline algorithm on PeMSD7(M) and    20), we set W s to a zero matrix while W l remains a learnable parameter matrix.As a result, the contribution of Ȳ s w;t 1 ∼t n to the model output is eliminated, and the importance of Ȳ s w;t 1 ∼t n can be assessed by comparing the model performance before and after ablation.A similar operation is performed for the ablation experiment on Ȳ s w;t 1 ∼t n , where W l is set to a zero matrix and W s remains a learnable parameter matrix.As a result, the contribution of Ȳ s w;t 1 ∼t n to the model output is eliminated.The reason for choosing this method is that we want to ablate the input features without changing the model structure, by merely modifying the weight matrices.By setting the weight matrix of a specific input feature to a zero matrix, we can completely eliminate the contribution of that feature to the model output, thereby assessing the importance of the feature.Additionally, since the result of multiplying any matrix by a zero matrix is still a zero matrix, this method is also computationally efficient.
Table 5 describes the results of the LSTSC model and its variants on the PeMSD7(M) and PeMS08 datasets.According to the experimental results of the two datasets, it can be found that the LSTSC model performs better than the LSTSC_NoLong, LSTSC_NoShort, and LSTSC_NoPeriod models for both short-term and long-term traffic flow prediction, respectively proving the effectiveness of long-term temporal correlation, short-term temporal correlation, and period.For the PeMSD7(M) dataset, the experimental results of LSTSC_NoShort are better than those of LSTSC_NoLong and LSTSC_NoPeriod, indicating that short-term temporal correlation has a lower weight than long-term temporal correlation and period, while the experimental results of LSTSC_NoPeriod are worse than those of LSTSC_NoLong, indicating that the weight of period is higher than that of long-term temporal correlation.For the PeMS08 dataset, the experimental results based on MAE, MAPE and RMSE metrics still reflect the conclusions obtained from the PeMSD7(M) dataset, where short-term temporal correlation have lower weights compared to long-term temporal correlation and periodicity, and periodicity has higher weights compared to long-term temporal correlation.
Due to the inherent periodicity in natural phenomena, traffic flow might exhibit cyclic patterns, with traffic patterns recurring on a weekly basis, for instance.Consequently, long-term temporal correlation could be more pronounced compared to short-term temporal correlation.In other words, traffic patterns may tend to repeat over longer time scales, such as a week, leading to stronger correlations in the long-term compared to short-term correlations.For example, let's consider a major urban freeway that experiences heavy traffic during weekdays due to work commutes, resulting in a daily traffic pattern.However, on weekends, the traffic flow on the same freeway might decrease significantly, leading to a different traffic pattern.Over time, this daily pattern may not be as consistent as the weekly pattern, where traffic flow experiences regular fluctuations during weekdays and weekends.The long-term temporal correlation, in this case, would capture the recurrent weekly pattern, while the short-term temporal correlation would mainly reflect the daily fluctuations.
In general, the long-term temporal correlation module, short-term temporal correlation module and period module can effectively improve the traffic flow prediction performance of the model.

Conclusion
In order to strengthen the capture of temporal correlation and effectively solve the dynamic spatial dependency and long-term temporal dependency in traffic flow prediction, we propose a multi-modal attention neural network for traffic flow prediction.In this model, an attention mechanism is designed to address the limited temporal dependency and fixed spatial dependency problems of the data.At the same time, CNNs are used to enhance the capture of temporal correlation in traffic data, and a fusion mechanism is designed to obtain the prediction results.In addition, we also design a multimodal attention neural network to solve the problem of single decision-making in the model.Finally, various experiments were conducted on two real-world datasets, and the results show that the performance of the proposed model in long-term traffic flow prediction is better than that of baseline algorithms.
The traffic speed dataset is collected by the California Department of Transportation in the seventh district of California through 228 road traffic sensors, and the collected data samples are aggregated every 5 min.The dataset records the vehicle speed of the seventh district of California from May 1, 2012 to June 30, 2012.• PeMS08 The traffic speed dataset is collected by the California Department of Transportation through 170 road traffic sensors, and the collected data samples are aggregated every 5 min.The dataset records the vehicle speed of San Bernardino, California, from July 1, 2016 to August 31, 2016.

Figure 11
Figure 11 shows the loss curve of LSTSC model on two real datasets about training set and verification set during the experiment.By observing Fig. 11(a), we can find that on the PeMSD7(M) dataset, for the training set and the verification set, the MAE of the two datasets gradually decreases with the increase of the number of training iterations, but when the number of iterations is 65, the MAE of the training set and the verification set starts to reach a certain stability.By observing Fig. 11(b), for the training set and verification set of PeMS08 dataset, the MAE of both datasets gradually decreases with the number of training iterations increasing, but when the number of iterations is 128, the MAE of the training set and verification set starts to reach a certain stability.

Table 2 .
Hyper parameter settings for the model.

Table 3 .
The traffic flow prediction results with the change of the parameters.